A Text-Mining Technique for Literature Profiling and Information Extraction from Biomedical Literature
نویسنده
چکیده
Massive amounts of biomedical literature are readily available online in many forms. Huge amounts of valuable knowledge and relationships are embedded in these resources and need to be properly extracted, discovered, and utilized. Recognizing and classifying biomedical entity names and terms are important steps for developing efficient knowledge/information extraction techniques from these repositories. This research investigates and develops effective computational methods for litera ture profiling for the biomedical field. Specifically, this paper presents new techniques for biomedical term identification and classification. We utilize the advances in feature selection techniques (e.g., MI, X) in IR in this task to select the key features for term identification and classification. We evaluated the method using Genia 3.0 corpus with about 3,000 to more than 34,000 biomedical terms and entity names. The outcome of this project can be applied in various fields including the Aerospace domain. In the aerospace field, there is a great interest in discovering the relations between certain changes in the body of astronauts and changes in structure at the levels of genes, proteins, and bindings. MASSIVE AMOUNTS OF BIOmedical literature are readily available online to researchers in many forms: text abstracts (PubMed contains over 14 million biomedical abstracts), full text research articles, databases of protein interactions, dictionaries of gene and protein names, and much more. Huge amounts of valu able knowledge and useful infor mation are embedded in these resources waiting to be properly extracted, discovered, and utilized. There is great need for computaDr. Hisham Al-Mubaid tional techniques to utilize and extract the useful knowledge from these resources. A number of systems and software tools have been developed to utilize these overwhelming resources. Biomedical research has shown that text mining can be effective in this field, making text mining increasingly important and nec essary for biology and medicine. The purpose of this research is to investigate and design an effective computational method for literature profiling to extract and organize important information and relationships from bio medical literature. For that, we implemented new methods to identify and classify technical terms and entity names in bio medical texts. The methods are based on machine learning and can be viewed as a word classification task. We utilized feature extraction techniques like MI (mutual information) and X (Chisquare) to select the key features in the contexts of the terms of interest. The methods were evaluated extensively with a large number of experiments. The outcome of this project can be applied in various fields including the Aerospace domain. In the field of aerospace, there is a great interest in discovering the relations between certain changes in the body of the astronauts (due to radiation, reduced gravity, and isolation) and the struc tural changes at the levels of genes, proteins, and bindings. Moreover, in aerospace, certain symptoms have need of being explained at the levels of gene or protein, so that the conse quences and future complications can be known and treated in a timely manner. Related Work In the biomedical domain, the majority of term identification and recognition techniques target certain specific entities and terms (mostly gene and protein names); this way term identifi cation and term classification are integrated as one task. A num ber of machine learning and statistically-based approaches have been proposed for term identification and classification in the past. For example, Morgan et al. used HMMs based on local context and simple orthographic and case variations and report ed F-measure of 75% for the recognition of Drosophila gene names. Moreover, Shen et al. used POS tags and noun heads as features and achieved F-scores of 16.7% to 80%, depending on NASA/UHCL/UH-ISSO • 45 Table 1. Results of the JNLPBA-2004 competition of Bio-Entity recognition: (recall/precision/F-score) results of each one of the participating systems and the baseline (BL), taken from Kim et al. (2004) 1978–1989 set 1990–1999 set 2000–2001 set S/1998–2001 set Total [Zho04] 75.3/69.5/72.3 77.1/69.2/72.9 75.6/71.3/73.8 75.8/69.5/72.5 76.0/69.4/72.6 [Fin04] 66.9/70.4/68.6 73.8/69.4/71.5 72.6/69.3/70.9 71.8/67.5/69.6 71.6/68.6/70.1 [Set04] 63.6/71.4/67.3 72.2/68.7/70.4 71.3/69.6/70.5 71.3/68.8/70.1 70.3/69.3/69.8 [Son04] 60.3/66.2/63.1 71.2/65.6/68.2 69.5/65.8/67.6 68.3/64.0/66.1 67.8/64.8/66.3 [Zha04] 63.2/60.4/61.8 72.5/62.6/67.2 69.1/60.2/64.7 69.2/60.3/64.4 69.1/61.0/64.8 [Rös04] 59.2/60.3/59.8 70.3/61.8/65.8 68.4/61.5/64.8 68.3/60.4/64.1 67.4/61.0/64.0 [Par04] 62.8/55.9/59.2 70.3/61.4/65.6 65.1/60.4/62.7 65.9/59.7/62.7 66.5/59.8/63.0 [Lee04] 42.5/42.0/42.2 52.5/49.1/50.8 53.8/50.9/52.3 52.3/48.1/50.1 50.8/47.6/49.1 BL 47.1/33.9/39.4 56.8/45.5/50.5 51.7/46.3/48.8 52.6/46.0/49.1 52.6/43.6/47.7 the class, and reported that POS tags proved to be among the most useful features. A number of approaches employed SVM for term identification and recognition. For example, Kazama et al. used SVMs for multi-class classification. They annotated the training data class label with B, I, and O labels to indicate that a term is beginning, inside, or outside the term. The JNLPBA-2004 competition included eight systems for the BioEntity recognition task. The competition was an open chal lenge, and the participants were allowed to use whatever tech niques and data resources they preferred. However, the systems were evaluated using a common evaluation methodology and a common dataset. Four types of classification models where used: SVM, HMM, MEMM, and CRFs. The overall results (Table 1) showed the recall ranges from 50.8% to 76.0%, preci sion from 43.6% to 69.4%, and F-score from 47.7% to 72.6%.
منابع مشابه
Extraction of Drug-Drug Interaction from Literature through Detecting Linguistic-based Negation and Clause Dependency
Extracting biomedical relations such as drug-drug interaction (DDI) from text is an important task in biomedical NLP. Due to the large number of complex sentences in biomedical literature, researchers have employed some sentence simplification techniques to improve the performance of the relation extraction methods. However, due to difficulty of the task, there is no noteworthy improvement in t...
متن کاملBiomedical Ontologies and Text Mining for Biomedicine and Healthcare: A Survey
In this survey paper, we discuss biomedical ontologies and major text mining techniques applied to biomedicine and healthcare. Biomedical ontologies such as UMLS are currently being adopted in text mining approaches because they provide domain knowledge for text mining approaches. In addition, biomedical ontologies enable us to resolve many linguistic problems when text mining approaches handle...
متن کاملBiomedical Literature Mining for Pharmacokinetics Numerical Parameter Collection
BIOMEDICAL LITERATURE MINING FOR PHARMACOKINETICS NUMERICAL PARAMETER COLLECTION Model-based drug studies have been developing very fast recently. They require high quality pharmacokinetics (PK) parameter numerical data. However, most parameter measurements are still buried in the scientific literature. Traditional manual data extraction is too expensive to handle the exponentially growing numb...
متن کاملKnowledge Management for Biomedical Literature: the Function of Text-mining Technologies in Life-science Research
Efficient information retrieval and extraction is a major challenge in life-science research. The Knowledge Management (KM) for biomedical literature aims to establish an environment, utilizing information technologies, to facilitate better acquisition, generation, codification, and transfer of knowledge. Knowledge Discovery in Text (KDT) is one of the goals in KM, so as to find hidden informat...
متن کاملConcept Chain Graphs: A Hybrid IR Framework for Biomedical Text Mining
The area of biomedical text mining has seen much research activity due to the increased volume of literature that must be examined. Researchers need to validate and interpret their experimental results; this entails scouring through a massive amount of potentially relevant literature for clues that may shed light on their findings. An ideal situation would allow a user to interactively search t...
متن کاملEffective use of latent semantic indexing and computational linguistics in biological and biomedical applications
Text mining is rapidly becoming an essential technique for the annotation and analysis of large biological data sets. Biomedical literature currently increases at a rate of several thousand papers per week, making automated information retrieval methods the only feasible method of managing this expanding corpus. With the increasing prevalence of open-access journals and constant growth of publi...
متن کامل